NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Spurious Rewards: Rethinking Training Signals in RLVR

Shao, Rulin; Li, Shuyue_Stella; Xin, Rui; Geng, Scott; Wang, Yiping; Oh, Sewoong; Du, Simon_Shaolei; Lambert, Nathan; Min, Sewon; Krishna, Ranjay; et al (June 2025, cs.AI)

Free, publicly-accessible full text available June 12, 2026
DataComp-LM: In search of the next generation of training sets for language models

Li, Jeffrey; Fang, Alex; Smyrnis, Georgios; Ivgi, Maor; Jordan, Matt; Gadre, Samir; Bansal, Hritik; Guha, Etash; Keh, Sedrick; Arora, Kushal; et al (April 2025, https://doi.org/10.48550/arXiv.2406.11794)

The authors introduce DataComp for Language Models (DCLM), a testbed for controlled dataset experiments aimed at improving language models. DCLM provides a standardized corpus of 240T tokens extracted from Common Crawl, effective pretraining recipes based on the OpenLM framework, and a broad suite of 53 downstream evaluations. Participants can experiment with dataset curation strategies such as deduplication, filtering, and data mixing at model scales ranging from 412M to 7B parameters. As a baseline, the authors find that model-based filtering is critical for assembling a high-quality training set. Their resulting dataset, DCLM-Baseline, enables training a 7B parameter model from scratch to achieve 64% 5-shot accuracy on MMLU with 2.6T training tokens. This represents a 6.6 percentage point improvement over MAP-Neo (the previous state-of-the-art in open-data LMs), while using 40% less compute. The baseline model is also comparable to Mistral-7B-v0.3 and Llama 3 8B on MMLU (63% and 66%), and performs similarly on an average of 53 NLU tasks, while using 6.6x less compute than Llama 3 8B. These findings emphasize the importance of dataset design for training LMs and establish a foundation for further research on data curation.
more » « less
Free, publicly-accessible full text available April 21, 2026
Material transformers: deep learning language models for generative materials design

https://doi.org/10.1088/2632-2153/acadcd

Fu, Nihang; Wei, Lai; Song, Yuqi; Li, Qinyang; Xin, Rui; Omee, Sadman Sadeed; Dong, Rongzhi; Siriwardane, Edirisuriya M; Hu, Jianjun (January 2023, Machine Learning: Science and Technology)

Abstract Pre-trained transformer language models (LMs) on large unlabeled corpus have produced state-of-the-art results in natural language processing, organic molecule design, and protein sequence generation. However, no such models have been applied to learn the composition patterns for the generative design of material compositions. Here we train a series of seven modern transformer models (GPT, GPT-2, GPT-Neo, GPT-J, BLMM, BART, and RoBERTa) for materials design using the expanded formulas of the ICSD, OQMD, and Materials Projects databases. Six different datasets with/out non-charge-neutral or EB samples are used to benchmark the generative design performances and uncover the biases of modern transformer models for the generative design of materials compositions. Our experiments show that the materials transformers based on causal LMs can generate chemically valid material compositions with as high as 97.61% to be charge neutral and 91.22% to be electronegativity balanced, which has more than six times higher enrichment compared to the baseline pseudo-random sampling algorithm. Our LMs also demonstrate high generation novelty and their potential in new materials discovery is proved by their capability to recover the leave-out materials. We also find that the properties of the generated compositions can be tailored by training the models with selected training sets such as high-bandgap samples. Our experiments also show that different models each have their own preference in terms of the properties of the generated samples and their running time complexity varies a lot. We have applied our materials transformers to discover a set of new materials as validated using density functional theory calculations. All our trained materials transformer models and code can be accessed freely at http://www.github.com/usccolumbia/MTransformer .
more » « less
Full Text Available
TCSP: a Template-Based Crystal Structure Prediction Algorithm for Materials Discovery

https://doi.org/10.1021/acs.inorgchem.1c03879

Wei, Lai; Fu, Nihang; Siriwardane, Edirisuriya M.; Yang, Wenhui; Omee, Sadman Sadeed; Dong, Rongzhi; Xin, Rui; Hu, Jianjun (June 2022, Inorganic Chemistry)

Full Text Available
Active-Learning-Based Generative Design for the Discovery of Wide-Band-Gap Materials

https://doi.org/10.1021/acs.jpcc.1c02438

Xin, Rui; Siriwardane, Edirisuriya M.; Song, Yuqi; Zhao, Yong; Louis, Steph-Yves; Nasiri, Alireza; Hu, Jianjun (July 2021, The Journal of Physical Chemistry C)
null (Ed.)
Full Text Available

Search for: All records